DATA SCIENCE SESSIONS VOL. 3

A Foundational Python Data Science Course

Session 17: Generalized Linear Models II. Binomial Logistic Regression and ROC analysis. Regularization of BLR.

← Back to course webpage

Feedback should be send to goran.milovanovic@datakolektiv.com.

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

Lecturers

Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner

Aleksandar Cvetković, PhD, DataKolektiv, Consultant

Ilija Lazarević, MA, DataKolektiv, Consultant


What do we want to do today?

In this session, we continue to develop the Binomial Logistic Regression model and introduce ROC (Receiver Operative Characteristic) analysis for classification problems. From ROC analysis, we will learn a lot about metrics for evaluating classification models: the True Positive, False Positive, True Negative, and False Negative rates; Precision and Recall; the F1 score. We will learn how to plot the model ROC curve and analyse it to determine the optimal decision threshold for any given Binomial Logistic Regression model.

We will optimize the Binomial logistic model directly using an optimization algorithm from the Scipy module which completes our theoretical study of this problem.

Finally, we introduce additional model indicators that take into account model complexity such as the Akaike Information Criterion (AIC). Finally, we regularize the Binomial Logistic Regression model and learn how to perform Binomial Logistic Regression in scikit-learn.

1. Binomial Logistic Regression, ROC analysis.

Target: Predict churn from all numeric predictors

The ROC analysis

True Positives (Hits): data say 1 and prediction is 1

False Positives (False Alarms): data say 0 and prediction is 1

True Negatives (Correct Rejections): data say 0 and prediction is 0

False Negatives (Misses): data say 1 and prediction is 0

True Positive Rate + False Negative Rate (or Hit + Miss)

False Positive Rate + True Negative Rate (or False Alarm + Correct Rejection)

The Akaike Information Criterion (AIC)

The Akaike Information Criterion (AIC) is a statistical measure used to evaluate the goodness-of-fit of a model. It is based on the principle of parsimony, which states that simpler models should be preferred over more complex ones, all else being equal.

The AIC is defined as follows:

$$AIC = -2\ln(\mathcal{L}) + 2k $$

where $\mathcal{L}$ is the model likelihood and $k$ is the number of parameters in the model.

The AIC penalizes models with more parameters, by adding a penalty term $2k$ to the log-likelihood $-2\ln(\mathcal{L})$. This penalty term is larger for models with more parameters, and hence it discourages overfitting and encourages simpler models.

The AIC can be used to compare different models and select the best one based on their AIC values. The model with the lowest AIC value is preferred, as it strikes a good balance between goodness-of-fit and simplicity.

Model effect: comparison to the Null Model

In Binomial Logistic Regression, the null model is a model with no predictor variables, meaning that it only includes an intercept term. The null model predicts the probability of the outcome variable based on its overall frequency in the sample, without taking any other variables into account.

Model Log-Likelihood

Null Model Log-Likelihood

The Likelihood ratio Chi-Squared statistic (sometimes termed: G) follows the $\chi^2$ distirbution.

Target: Predict churn from all the predictors

ROC Analysis Elaborated

The ROC plot can thus also be understood as $1-Specificity$ vs $Sensitivity$ plot (beacuse $FPR = 1 - Specificity$). In mathematical statistics, the Type I Error indicates a situation in which our statistical model predicts something which is not the case in the population (that is your $\alpha$ level in statistical analyses): this concept is completely mapped by our $FPR$ or False Alarm. On the other hand, Statistical Power is the probability by which a statistical model (or test) can successfully recover an occurrence in the population: and this is perfectly matched by our understanding of $TPR$ or Hit Rate. Thus, the ROC also plots the Type I Error against Statistical Power.

Precision (or Positive Predictive Value (PPV)) and False Discovery Rate (FDR)

$$PPV = \frac{TP}{TP+FP}=1-FDR$$

The classifier's Precision is the ratio of True Positives to the sum of True Positives and False Positives: the ratio of correct ("relevant") classifications to the number of positive classifications made.

$$FDR = \frac{FP}{FP+TP}=1-PPV$$

The classifier's False Discovery Rate is the ratio of False Positives to the sum of True Positives and False Positives: the ratio of incorrect ("irrelevant") classifications to the number of positive classifications made.

Accuracy and Balanced Accuracy

In cases of highly imbalanced classes in binary classification, Accuracy can give us a dangerously misleading result:

$$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$$

Balanced Accuracy ($bACC$) can be used to correct for class imbalance:

$$bACC=\frac{TPR+TNR}{2}$$

To understand how $bACC$ works, think of the following case: we have a dataset with 100 observations of which 75 are class $C_1$ and 25 of class $C_2$; hence, the model that always predict $C_1$ and never $C_2$ must be accurate 75% of time, right? However, it's $bACC$ is only .5 (because its $TPR=1$ and $TNR=0$).

The $F_1$ score

This is traditionally used to asses how good a classifier is:

$$F_1 = 2\frac{Precision\cdot Recall}{Precision+Recall}$$

and is also known as F-measure or balanced F-score.

Note.

$$Precision = \frac{TP}{TP+FP}$$$$Recall = \frac{TP}{TP+FN}$$

We can thus use a more general $F-score$, $F_\beta$, where $\beta$ determines the times recall is considered as important as precision:

$$F_\beta = (1+\beta^2)\frac{Precision\cdot Recall}{\beta^2 \cdot Precision+Recall}$$

Area Under the Curve (AUC)

This is probably the most frequently used indicator of model fit in classification. Given that the ideal classifier is found in the top left corner of the ROC plot, i.e. where $TPR=1$ and $FPR=0$, it is obvious that the best model in some comparison is the one with the greatest area under the ROC.

Balanced Accuracy to correct for class imbalance: $bACC=\frac{TPR+TNR}{2}$

Precision (or Positive Predictive Value (PPV)): $PPV = \frac{TP}{TP+FP}=1-FDR$

The model Recall is simply the Hit rate (TPR):

The $F_1$ score

$$F_1 = 2\frac{Precision\cdot Recall}{Precision+Recall}$$

2. BLR using scikit-learn

sklearn: Predicting churn from numerical predictors

Predicting with numerical variables only:

sklearn: Predicting churn from all the predictors

3. MLE for Binomial Logistic Regression

Say we have observed the following data: $HHTHTTHHHT$. Assume that we know the parameter $p_H$. We can compute the Likelihood function from the following equation:

$\mathcal{L}(p_H|HHTHTTHHHT)$ exactly as we did before. Now, this is the general form of the Binomial Likelihood (where $Y$ stands for the observed data):

$$\mathcal{L}(p|Y) = p_1^y(1-p_1)^{n-y}$$

where $y$ is the number of successes and $n$ the total number of observations. For each observed data point then we have

$$\mathcal{L}(p|y_i) = p_1^{y_i}(1-p_1)^{\bar{y_i}}$$

where ${y_i}$ is the observed value of the outcome, $Y$, and $\bar{y_i}$ is its complement (e.g. $1$ for $0$ and $0$ for $1$). This form just determines which value will be used in the computation of the Likelihood function at each observed data point $y_i$: it will be either $p_1$ or $1-p_1$. The likelihood function for a given value of $p_1$ for the whole dataset is computed by multiplying the values of $\mathcal{L}(p|y_i)$ across the whole dataset (remember that multiplication in Probability is what conjunction is in Logic and Algebra).

Q: But... how do we get to $p_1$, the parameter value that we will use at each data point?

A: We will search the parameter space, of course, $\beta_0, \beta_1, ... \beta_k$ of linear coefficients in our Binary Logistic Model, computing $p_1$ every time, and compute the likelihood function from it! In other words: we will search the parameter space to find the combination of $\beta_0, \beta_1, ... \beta_k$ that produces the maximum of the likelihood function similarly as we have searched the space of linear coefficients to find the combination that minimizes the squared error in Simple Linear Regression.

So what combination of the linear coefficients is the best one?

It is the one which gives the Maximum Likelihood. This approach, known as Maximum Likelihood Estimation (MLE), stands behind many important statistical learning models. It presents the corner stone of the Statistical Estimation Theory. It is contrasted with the Least Squares Estimation that we have earlier used to estimate Simple and Multiple Linear Regression models.

Now, there is a technical problem related to this approach. To obtain the likelihood for the whole dataset one needs to multiply as many very small numbers as there are data points (because $p_1 < 1$, and even $p_1 \ll 1$). That can cause computational problems related to the smallest real numbers that can be represented by digital computers. The workaround is to use the logarithm of likelihood instead, known as Log-Likelihood ($LL$).

Thus, while the Likelihood function for the whole dataset would be

$$\mathcal{L}(p|Y) = \prod_{i=1}^{n}p_1^{y_i}(1-p_1)^{\bar{y_i}}$$

the Log-Likelihood function would be:

$$LL(p|Y) = \sum_{i=1}^{n} y_ilog(p_1)+\bar{y_i}log(1-p_1)$$

And finally here is how we solve the Binomial Logistic Regression problem:

Technically, in optimization we would not go exactly for the maximum of the Likelihood function, because we use $LL$ instead of $\mathcal{L}(p|Y)$. The solution is to minimize the negative $LL$, sometimes written simply as $NLL$, the Negative Log-Likelihood function.

Implement the model predictive pass given the parameters; the folowing blr_predict() function is nothing else than the implementation of the following expression:

$$P(Y) = p_1 = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k)}}$$

Test blr_predict()

Now define the Negative Log-Likelihood function:

Test blr_nll():

Optimize!

Check against statsmodels

Plot the Model Log-Likelihood Function

4. Regularization of BLR

We will use scikit-learn to regularize the BLR model immediately.

It is recommended to standardize the feature matrix $X$ before using regularized logistic regression in scikit-learn, particularly if the features have different scales. Regularization techniques like $L1$ and $L2$ regularization are sensitive to the scale of the features, and features with larger scales may dominate the regularization penalty. Standardizing the features can help to mitigate this issue and ensure that all features contribute equally to the model.

To standardize the features in scikit-learn, we use the StandardScaler transformer from the sklearn.preprocessing module:

LASSO (L1) Regularization

Please make sure to study the the LogisticRegression() documentation from Scikit-Learn: sklearn.linear_model.LogisticRegression:

C float, default=1.0

Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

Examine a range of C inverse regularization penalty:

Ridge (L2) Regularization

Examine a range of C inverse regularization penalty:

ElasticNet Regularization

l1_ratio float, default=None

The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Only used if penalty='elasticnet'. Setting l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to using penalty='l1'. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2.

Now, using F1 to score the model:

NOTE the following:

Ooops. Class imbalance problems? Use the class_weight argument to LogisticRegression():

class_weight dict or ‘balanced’, default=None

Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

The class imbalance problem can severely affect your model ROC analysis; especially Accuracy as an indicator of model fit does not make too much sense when class imbalance is present in the data.

Further Reading


DataKolektiv, 2022/23.

hello@datakolektiv.com

License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.